TB-DROP: deep learning-based drug resistance prediction of Mycobacterium tuberculosis utilizing whole genome mutations

The most widely practiced strategy for constructing the deep learning (DL) prediction model for drug resistance of Mycobacterium tuberculosis (MTB) involves the adoption of ready-made and state-of-the-art architectures usually proposed for non-biological problems. However, the ultimate goal is to construct a customized model for predicting the drug resistance of MTB and eventually for the biological phenotypes based on genotypes. Here, we constructed a DL training framework to standardize and modularize each step during the training process using the latest tensorflow 2 API. A systematic and comprehensive evaluation of each module in the three currently representative models, including Convolutional Neural Network, Denoising Autoencoder, and Wide & Deep, which were adopted by CNNGWP, DeepAMR, and WDNN, respectively, was performed in this framework regarding module contributions in order to assemble a novel model with proper dedicated modules. Based on the whole-genome level mutations, a de novo learning method was developed to overcome the intrinsic limitations of previous models that rely on known drug resistance-associated loci. A customized DL model with the multilayer perceptron architecture was constructed and achieved a competitive performance (the mean sensitivity and specificity were 0.90 and 0.87, respectively) compared to previous ones. The new model developed was applied in an end-to-end user-friendly graphical tool named TB-DROP (TuBerculosis Drug Resistance Optimal Prediction: https://github.com/nottwy/TB-DROP), in which users only provide sequencing data and TB-DROP will complete analysis within several minutes for one sample. Our study contributes to both a new strategy of model construction and clinical application of deep learning-based drug-resistance prediction methods. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10066-y.


Introduction
Tuberculosis, caused by Mycobacterium tuberculosis (MTB), is a serious public health problem worldwide.According to Global Tuberculosis Report 2022 [1], 6.4 million patients were newly diagnosed with tuberculosis, among whom about 1.4 million HIV-negative patients and 187,000 HIV-positive patients died in 2021.Tuberculosis is also a high health burden in China [2].The emergence of drug-resistant MTB has posed a severe challenge to global tuberculosis prevention and treatment.Drug resistance is traditionally diagnosed using culturebased antimicrobial susceptibility testing.However, this approach is relatively slow and expensive.Further, it has inherent inaccuracies and issues with reproducibility [3].One critical challenge in tackling the global TB epidemic is timely diagnosis and correct treatment.Rapid molecular diagnostic tests can promote early detection and prompt treatment [4].As drug resistance of MTB is mainly conferred by nucleotide variations in genes encoding drug targets or drug-converting enzymes [5], molecular detection of mutations can be used for quick detection and guiding treatment of drug resistant-MTB.
Currently, the sequencing technology is being used for predicting drug-susceptibility as it provides a wide range of information on mutations [6,7].These methods can be divided into two categories: 1) direct association (DA) method, which identifies known mutations related to drug resistance from whole genome sequencing (WGS) data, such as KvarQ [8], CASTB [9], MyKrobe Predictor TB [10], PhyResSE [11], TGS-TB [12], TBProfiler [5], and SAM-TB [13].These tools rely heavily on the library of identified resistance-related sites and have many limitations.The missing nucleotide calls of these mutations or unknown association of mutations that affect drug resistance genes may directly lead to prediction failure.For four first-line anti-tuberculosis drugs, these problems would lead to no prediction of 4.7 − 10.2% isolates [14].The unknown resistance mechanisms of most nonfirst-line drugs would lead to low prediction accuracy [15]; DA cannot model gene-gene interactions, and with the increase in MDR (multi-drug resistant) rate, the prediction performance will decrease [16]; 2) machine learning, that is, using sequencing data and drug susceptibility test (DST) data to establish a predictive model for biological phenotypes, including drug resistance.Various machine learning algorithms have been used for MTB drug resistance analysis, such as logistic regression [17], random forest [18], decision tree [19] and gradient boosting tree [20].As a branch of machine learning, deep learning is now being widely applied for predicting biological phenotypes based on genomic mutations.Waldmann et al. [21] designed a convolutional neural network CNNGWP for genome-wide prediction.Bellot et al. [22] evaluated the performance of convolutional neural network (CNN) and multilayer perceptron (MLP) for predicting five complex human phenotypes.Regarding tuberculosis, Chen et al. [16] assessed the ability of a Wide & Deep model, which was designed for recommender system on MTB drug resistance prediction with 3,601 MTB strains and their 222 single nucleotide polymorphisms (SNPs).The Wide & Deep model outperformed existing approaches based on DA and previously reported machine learning models.Yang et al. proposed two DL-based models: denoising auto-encoder [23] and heterogeneous graph attention network [24].Jiang et al. considered drug resistance prediction as a document classification problem and constructed a hierarchical attentive neural network model inspired by natural language processing [25].Anna et al. divided the drug resistance prediction into multi-drug resistance prediction and single-drug resistance prediction and constructed two CNN-based models for each of them [26].ML-based methods can establish the mapping relationship between mutations and DST data through de novo learning without prior biological knowledge and adapt to ever-increasing biological data.
As depicted above, abundant of machine learning models were adopted to predict the MTB drug resistance.Therefore, researchers started to summarize methods [27] and consider how to apply them in clinical practice [28].Here, we take this work with concrete codes and practice, the characteristics of WDNN, DeepAMR, and CNNGWP (the representative CNN model) models were depicted comprehensively and reimplemented for benchmarking, which helped researchers evaluate the performance of each model accurately and objectively to build a foundation for improving the existing models and designing better models.The models we benchmarked were further customized and fine-tuned for developing a deep learning MTB drug resistance tool with whole genome mutations as input data.Hyperparameters and architectures of models were customized for whole genome mutations.genTB, the only available deep learningbased tool, can be easily used by clinicians [29].However, genTB was based on known drug resistance-associated loci and was not applicable for patients harboring novel loci.Therefore, a model utilizing whole genome mutations was constructed and used in the TB-DROP (https:// github.com/ nottwy/ TB-DROP).
This study aims to construct a customized deep learning-based model for predicting the drug resistance of MTB using whole genome mutations and bring it to clinicians through a user-friendly tool, TB-DROP.The strategies we adopted to construct our model was different from the current most widely used strategy and it would contribute to construction of a model suitable for predicting the biological phenotypes based on genotypes.The whole genome mutations were used as the input of our model and supported our de novo learning strategy without relying on a known drug resistance mutations library.

Training and evaluation of models on our dataset
Figure 1 summarizes the phenotypes of the 12,478 MTB strains available for analysis.After variant calling and filtering, 620,169 variants are retained for drug resistance prediction.The size of the drug-sensitive samples of all types of drugs is lesser than those of the drug-resistant samples.
Ten-Fold (10X) cross validation with the MSSS sample split strategy was performed to measure the performance of each model.Then the mean and variance (Table 1) of AUC, and sensitivity, specificity, precision, and negative predictive rate (NPV) of the results were calculated to measure the average performance and the stability of each model.The loss curves of each model were also presented to show the training conditions of each model (Fig. 2), which could reflect whether models were reliable and were an important metric of a model (separate loss curve of each model can be found in Additional file 2).All 10 loss curves of 10X cross validation were checked for each model and found to be similar.Hence, the representative one was selected and presented here.The picture shows the loss curves of all models declined steadily and finally reached a plateau except DeepAMR.For DeepAMR, many hyperparameters were tried, but its validation loss curve was still U-shaped, indicative of overfitting.

Comparison of models' performance
The MLP-based model was chosen as the representative one and compared to other three machine learning-based models: WDNN [16], DeepAMR [23] and GBT-CRM [20].The metrics of each model were obtained from corresponding articles, while WDNN and DeepAMR provided information regarding only sensitivity, specificity and AUC.Before comparison, two important factors that influence model metrics considerably should be introduced: the total sample size and the test size.The sample size of GBT-CRM was larger than that of ours, and fewer samples were used in the test size.Large sample sizes can train models better, and small test sample sizes would be less challenging for models.Although GBT-CRM benefited from these two factors, the difference in AUCs between it and our model was small (2.1%-5.0%)(Table 2).We expected that the difference between two models would continue to diminish when the amount of data was identical and the ratio between training and test was 1.The AUC values of all models were above 0.9 (Table 2), with the exception of WDNN's 0.883 for PZA, which indicated that Deep-AMR, GBT-CRM and TB-DROP had good and stable performance.According to metric AUC, the DeepAMR had the best performances.It is noteworthy that Deep-AMR was based on the known mutations related to drug resistance and it can not work well when encountering novel mutations and unknown resistance mechanisms.
The values of the other metrics were influenced by the choice of thresholds distinguishing drug-resistant isolates and drug-susceptible isolates.They were comparable only to a certain extent.Our model paid more attention to the sensitivity of drug-resistant isolates and NPV of drug-susceptible isolates, as high sensitivity ensured that fewer drug-resistant isolates were predicted as drug-susceptible, while the high NPV ensured that antimicrobial drugs used to treat patients may work.For the drugs of RIF, EMB, and PZA, our models performed better than the GBT-CRM model in terms of metric sensitivity and NPV (Table 2).Regarding metric sensitivity, the DeepAMR and our models were stable at over 85%, while the WDNN model only achieved 75% in the drug PZA (Table 2).All these results indicated that de novo drug resistance prediction based on deep learning model utilizing whole genome mutations was as competitive as previous models and could deliver a better drug resistance predicting ability than the deep learning models based on the known drug resistance genes and machine learning models based on whole genome mutations.

TB-DROP: MTB Drug Resistance Optimal Predictor
The trained deep learning model and the variant calling pipeline were used in a docker environment.The workflow and user interface of TB-DROP are shown in Figs. 3 and 4. In the TB-DROP interface, users only need to perform a simple two-step operation to obtain drug resistance status of MTB: "Uploading" the sequencing data in the fastq format and click "Start Analysis".The whole analysis costs only < 20 min for each sample on a computer with an AMD Ryzen 5 2600 Six-core Processor with 16G RAM.For samples that have been analyzed, the prediction result will be presented while users click the Sample_ID.

Discussion
Laboratory mislabeling of the drug resistance status of MTB should be excluded.A reliable standard for removing laboratory mislabeling involved removal of isolates phenotypes of which were discordant with the genotypes.For example, one isolate was recorded as susceptible but harbored high-level resistance mutations.One main aims of this study was to evaluate the potential of the neural network for predicting de novo drug resistance by utilizing the whole genome mutations.Therefore, only first-line drugs with large sample size (more than that of second-line drugs) were evaluated in this study.It is reasonable to infer that the performance of DL models is similar here for the second-line drugs when their sample sizes achieve similar levels.
The three models used in this study were not designed specifically for predicting phenotypes according to genotypes.CNN, the architecture of which was inspired by the human visual system, was proposed by LeCun et al. [30] for recognizing handwritten zip code.WDNN [31] was developed for constructing the recommender system, the original inputs of which were user and contextual information, and the desired output was relevant items that users might be interested in.The successful application of these models on biological phenotypes only reflected that there were similar relationships between inputs and outputs.An architecture designed specifically according to biological genotype-phenotype relationships was in demand [22].The first step toward realization of this goal was to evaluate the performances of different architectures objectively.Nevertheless, due to the complexity of deep learning neural architectures and MTB drug resistance, a comprehensive benchmark of these models was not performed yet, which prevents researchers from the development of the most suitable model and enhancement of prediction performance [32].
The relatively low PPV for predicting the EMB and PZA drug resistance (0.524 for EMB and 0.410 for PZA) largely came from the imbalanced dataset (Table 2), where the ratio of positive to negative samples was far from 1 (0.184 for EMB and 0.139 for PZA).When the ratio was 1, the PPV was around 0.85 (assuming the numbers of positive and negative samples were both 2,000, and the sensitivity and specificity did not change, the PPVs were 0.857 and 0.831, respectively).Therefore, the PPV does not influence the sensitivity, and most of the patients carrying drug resistant MTB, including EMB and PZA, should be detected correctly using our predication tool.
Several methods can be used to improve the performance of deep learning models.First, collection of reliable data is critical.Currently, the input features were encoded as 0 and 1, where 0 represented no mutation and 1 represented a mutation, although there were other commonly used representation methods, such as "012", where 0 represents no mutation, 1 represents heterozygous genotype, and 2 represents homozygous alternative genotype and "one-hot encoding".Further evaluation is required to evaluate the suitability.Each model evaluated in this study had its specific preprocessing steps.However, we have not evaluated the performance of all these preprocessing methods.In addition, correct conclusion can only be obtained from the correct combination of all components, from input representation to model architecture and hyperparameters.Unsuitable combination may affect the function of components.A model architecture that was more suitable for genomic mutations and prediction of biological phenotypes can be designed referring to the following perspectives: 1) determining the functions of various types of mutations in the genome, including the relationship among mutations and that between mutations and phenotypes [33], and then designing an architecture that can represent such relationships; 2) multi-task or single-task.The multitask neural network updated the weights of the network according to the total loss of all tasks and the single-task neural network learned the weights for a specific drug.The multitask neural network can learn from all labels and hence had more samples.However, different labels may conflict with each other, and lead to wrong update of the weights of the neural network and finally perturb the metrics of the model.The bias of the final layer of the MLP model was updated weirdly, which might be caused by the multitask architecture; 3) the ideal solution for finding the best hyperparameters was to be able to traverse all hyperparameter combinations and then consider the optimal hyperparameter combination.However, this will require considerable amount of computational resources and time.One main aims of this study was to evaluate each module of multiple representative models and to assemble a new model based on the contribution of each module.In future, we will improve the hyperparameter tuning strategy; 4) in addition to the three representative deep learning models used here, models of natural language processing can also be applied to deal with drug resistance prediction, if we consider the whole genome mutations as a document and the resistant and the susceptible phenotypes as two types of the document.The modules and the learning processes of any existing models were proposed for their own aims.Therefore, we need to deepen our understanding of the mechanism of how genotypes determine phenotypes and reveal the mathematical functions and learning processes that can truly characterize the mechanism.In this way, we can change from simply borrowing modules from existing models to proposing really suitable modules and learning processes.As a tool positioned for use in clinical practice, TB-DROP also requires continuous accumulation of experience and update of models to deal with problems arising in practice.

Phenotypic and sequencing data
The datasets were obtained from two previously published studies and consist of 12,478 isolates with WGS data and phenotypic DST data [14,34].The SRA accession numbers of all raw datasets are listed in Additional file 3. The phenotype data included resistance status for four first-line drugs (rifampicin, isoniazid, pyrazinamide, and ethambutol).Phenotypic data were classified as resistant, susceptible, or unknown.

Building the predictor sets of features
The features used for prediction were classified into two groups.In one group, each mutation in the genome was used as the predictive feature.The presence of a mutation in the isolate was represented by a binary variable, with 1 indicating the presence of the mutation and 0 indicating its absence.In the other group, to reduce the feature dimension, we used 100-bp windows to divide the entire genome into 44,116 regions, and the number of mutations in each region was considered the predictor.

Designing and training the TB-DROP model
Both designing the novel models and comprehensively comparing the existing methods are essential for developing an efficient neural network for MTB drug resistance.Multiple neural networks have been used to predict biological phenotypes, including MTB drug resistance.To use the best neural network in our tool, four architectures were summarized, reimplemented, optimized, and compared.
First, we summarized the features, advantages, and disadvantages of each model, with the intention of mainly targeting neural network designers.The comprehensive and in-depth summary may facilitate the construction of more suitable models.Next, we reimplemented WDNN, DeepAMR, and CNNGWP in the same framework according to their published source codes.The same framework guaranteed that each model can be depicted systematically.In this way, we could clearly determine the modules to be used for each model.In addition, it was convenient to append a module to a model at the right place.The reimplemented version of each model was evaluated with the datasets available with the source codes to prove that we did restore the model to a certain extent.
Each model was optimized to accommodate the condition that the whole genome mutations were used as inputs, during which the advantages of each model were retained and the defects overcome.Our dataset was used as the input of each model, which was optimized according to its training/validation loss curves and metrics on the validation dataset.The hyperparameters were tuned according to their functions and the depth up to which learning models gain satisfied generalization.For all models, the weight of the neural network with the lowest validation loss value during the training process was used in the final model.
Finally, the optimized models were evaluated on the test dataset, and the model with the best performance was selected as the representative model and compared with other MTB drug resistance prediction

TB-DROP graphical user interface
TB-DROP is based on Docker (https:// www.docker.com/) to enable a new and promising virtualization strategy that provides the advantage of being platformagnostic due to its configuration of containers.Containers can be consistently interchanged and used in different computing environments, irrespective of the differences in user hardware and/or operating systems.These features of the containers ensure replicability and reproducibility of data analyses across different facilities [40].
TB-DROP is an automated, easy-to-use, and webbased GUI tool deposited at github (https:// github.com/ nottwy/ TB-DROP).During development, the bioinformatics pipeline, trained deep learning model, and user-interface in a custom docker image were used, and the characteristics of the docker were utilized to develop tools compatible with the cross-platforms, including Windows, Linux, and MacOS.The resistance results can be visualized on a browser directly.
Users need to upload the sequencing data on the webpage and initiate the analysis.The result of drug resistance is usually returned in a few minutes.

Summarization of MTB-related and phenotype-related DL models
Each model has its own advantages and unique features, as well as limitations.Extensively learning advantages of each model, and then avoiding shortcomings in their designs will assist in designing better models.The characteristics of each model are shown in Table 3.The wide and deep architecture enables WDNN to consider additive effects and interactions between mutations simultaneously [16,31].WDNN utilized an alpha value as the class weight to increase the weight of the class, the sample size of which was smaller.
where n1 was the number of drug-resistant MTB and n2 was the number of drug-susceptible MTB.The number of MTB, the resistance status of which was missing, was not considered.The custom loss function was a classweight binary cross entropy: where n was the total types of drugs, m represents the number of isolates with resistance status for each drug, P rtrue,ij indicated the true probability for the i-th drug and j-th MTB being resistant (1.0 for resistant MTB and 0.0 for susceptible MTB).P rpred,ij indicated the predicted probability for the i-th drug and j-th MTB being resistant.Since WDNN was a multi-task model that had only one loss for all drugs, ∑∑ was used to add up all classweight binary cross entropy of each sample.
In addition to predicting the resistance to individual drugs, DeepAMR also predicted the drug resistance of MDR-TB (multi-drug resistant Mycobacterium tuberculosis) and PANS-TB (MTB that is susceptible to all four first-line drugs, isoniazid (INH), rifampicin (RIF), ethambutol (EMB), and pyrazinamide(PZA)).The further classification of drug resistance phenotypes remind us that the mechanism controlling drug resistance might change if MTB developed from single-drug resistance to multidrug resistance.Manzour et al. [41] reported that mutations in katG315 appeared more frequently in MDR-TB and that mutations in the inhA promoter appeared more frequently in single-drug resistant MTB.Sintchenko et al. [42] observed that mutations in rpoB existed both in RIF-resistant MTB and RIF-susceptible MTB.Regarding the model architecture, DeepAMR used the denoising autoencoder, which rendered the model more robust when genotypes were missing and sequencing error was present.Furthermore, the dimension reduction achieved using the autoencoder can considerably reduce the amount of calculation.
Furthermore, a better sample splitting strategy, multilabel stratified shuffle split (MSSS) [43] was used by Deep-AMR than WDNN's naive KFold (provided in sklearn), which does not consider group information.The drawback of KFold was that random splitting of samples could lead to a split (train/test) where the train group or the test group did not contain all categories for one or more labels in a multi-label condition.This would lead to the failure of training if it happens in the training dataset, as some categories would be missing.So we performed it in the test dataset to assess the failure of calculating some metrics (i.e., missing positive samples would lead to the failure of calculation of sensitivity because sensitivity equals to "true positive / all positive samples").
Using SNPs from the whole genome, Waldmann et al. [21] attempted to apply CNN to predict quantitative traits.CNN has many advantages in predicting phenotypes using genomic mutations.The small convolutional kernels can capture local signals out of the whole genome that might be related to drug resistance and save computational resource at the same time.The pooling layer can make the model robust when there were little changes in the genome.
The MLP is a quintessential neural network model which was proposed long time ago.A typical MLP consists of an input layer, an output layer, and many hidden layers.These layers are fully connected with each other, because of which MLP requires many computational resources.The reason for choosing this model was: (1) it is initially inspired by WDNN.If we modified the structure of the WDNN model and removed the wide part of it, the rest of the WDNN model was a DNN (or MLP) model; ( 2) it has the most basic architecture and we should evaluate its performance before we try other more complex architectures; (3) many researchers have started to focus on MLP again and have proposed many excellent architectures to improve its performance [44].The final result was that MLP had performed best before we tried new advanced MLP models.

Reimplementation of deep learning models
Although all the models evaluated in this study were implemented with keras (with tensorflow as their backend engine), they were implemented with tensorflow 1.X (WDNN, DeepAMR and CNNGWP) or R version (CNNGWP).Tensorflow 1.X is deprecated and the grammar changed considerably in tensorflow 2.X.Models implemented using different languages (R and Python tensorflow) increased the difficulty of utilization and comparison.Therefore, reimplementation is necessary and all models were reimplemented with Python in the tensorflow 2.X environment.As we will utilize the good design of each published model and make an objective comparison to construct a better MTB drug resistance predicting model, and provide guidelines for optimizing the neural network-based phenotype predictors in the future, we implemented all models in the same standard framework and ensured that all modules and parameters were consistent with the original model.We first carefully inspected the source code of each model and implemented each model in strict accordance with the source code.Next, these models were used in the same framework to ensure that the modules used by each model were the same and that the execution order was the same.Finally, we tested the performance of our reimplemented versions and the original source code versions on the dataset accompanying each source code to measure the consistency between the reimplemented and the original models.The outputs of the source codes of each model were used as the standard answers.
The outputs of our implementation were compared with standard answers (Table 4) and the datasets used for each model were introduced below.WDNN's author provided both the training and test datasets.The training datasets consisted of 3,601 isolates and 6,483 SNPs, and the test dataset consisted of 792 isolates and 222 SNPs.In the WDNN's source codes, it predicted MTB drug resistance of 11 drugs.Area under the ROC curve (AUC) and AUC precision-recall (PR) were chosen as the metrics.Therefore, the difference of sum of AUC between our reimplemented version and the original version was chosen as the measurement of how well we reimplemented the model.The authors of DeepAMR provided a dataset consisting of 8,388 isolates and 5,823 SNPs.It only predicted MTB drug resistance of four drugs.Sensitivity, specificity, AUC, and F1 scores were selected as its metrics.Here, the difference of sum of AUC between our reimplemented version and the original version was also chosen as the measurement of how well we reimplemented the model.In the publication of CNNGWP, two datasets, a simulated and a real dataset, were used to evaluate the performance of the CNNGWP model.The real dataset is not accessible.Therefore, only the performance on the simulated dataset was compared.Input introduction: 3,226 samples and 9,723 SNPs.The input representations of 0, 1, and 2 indicated lower homozygote, heterozygote, and upper homozygote.The prediction target here was a continuous quantitative trait, and the mean squared error (MSE) of the test dataset was used to evaluate the performance of CNNGWP.The sizes for training and testing datasets were 2,326 and 900.Using the best hyperparameters provided in the article, the MSE on the test dataset was 63.04 in our implementation, which was almost similar to that reported in this study (62.34).The comparison is presented in Table 4, and we noticed that the performances of our reimplemented version and the original version were similar.Therefore, our reimplemented version represented the original one.

Construction of TB-DROP deep learning models
As we switched to using whole-genome mutations as input, the amount of layer weights was large.Therefore, the model architecture and hyperparameter tuning strategy adopted were intended consider the published models as the starting point and then further tune them according to the problems encountered.The representative architectures and characteristics of the four models are presented in Fig. 5. Capturing the core features of each model and the essential differences between the models when putting them together was easier.More details regarding hyperparameter tuning are presented in Additional file 1.
The goal of the original design of WDNN was to enable the final classification layer to learn (1) directly from the input data (the wide part) and (2) the highly abstract features after refining through multiple neural layers (the deep part) simultaneously (Fig. 5).However, when the input data changed from the mutations residing in the drug resistance-related genes to mutations in the whole genome, the number of mutations increased considerably.Most of the newly added mutations must not be related to drug resistance.Therefore, the model training faced more computational pressure, and at the same time, the final output layer found it challenging to learn the mapping relationship between whole genome mutations and the drug resistance phenotypes.DeepAMR consisted of two parts: (1) The model will train a denoising autoencoder first, the input and output of which were identical.(2) The highly compressed and dimension-reduced encoded features obtained from part one would be fully connected to the output layer to predict the drug resistance of MTB (Fig. 5).The main features of CNN-based model were the convolution and pooling layers (Fig. 5).A convolution layer could learn the interaction between mutations and save more computation than a fully connected layer.A pooling layer can increase the model's ability of resisting noise and save computation.The model used in TB-DROP, a MLP (Fig. 5), was constructed finally based on the observation and summarization during the adjustment of the three published models.As analyzed above, we found that the wide part in the original design of WDNN is no longer suitable for our scenario.Therefore, the wide part of WDNN was removed from the architecture of WDNN and the model became a traditional MLP model.The performance of the model did not change, indicating the wide part that WDNN contributed negligibly.

Conclusion
The deep learning training framework developed in this study contributes substantially to the in-depth understanding of the characteristics of the models, as well as the standardization and optimization of the training process.The three representative models were summarized and benchmarked systematically and comprehensively using this framework to discover the strengths and weaknesses of these modules, which provided a reliable basis for researchers who aim to develop more effiecient deep learning-based models.The de novo MTB drug resistance prediction tool TB-DROP developed to overcome the previous limitations from novel mutations and secondline drugs and the rarely reported drug-resistance genes.The small variance (the stable performance in 10 × cross validation) was a symbol of stability and convergent loss curve, which indicated a model was well-trained.These works guarantee the reliability of the model provided in TB-DROP.The development of TB-DROP cleared the barriers for clinicians in applying deep learning models, as well as laid the foundation for the application of highly efficient models in the clinic in the future.

Fig. 1
Fig. 1 The representative architectures of four models, including the TB-DROP.The upper-left panel is the model architecture of WDNN, which comprises two parts: wide part and deep part.The model architecture implemented in TB-DROP (upper-right) is a deep neural network (DNN).The bottom-left panel is the model architecture of DeepAMR, which comprises encoder, decoder, and output layers.The model architecture of CNNGWP in the bottom-right panel is a classic convolutional neural network which consists of a convolutional layer and a pooling layer

Fig. 2
Fig. 2 Summary of drug resistance status of all isolates

Table 1
Metrics of modified four main neural network models Abbreviations: AUC The area under the receiver operating characteristic curve, Var variance of values of ten folds cross-validation, Positive samples are drug resistant MTB; Negative samples are drug susceptible MTB; tp: true positive, tn: true negative, fp: false positive, fn: false negative, sensitivity: tp/(tp + fn), specificity: tn/(tn + fp), precision: tp/(tp + fp), NPV: negative predictive value, tn/(tn + fn) The bold values indicate the highest performance values among four models.The values presented here were average values of tenfold cross validation.The values in the parenthesis are the variance of values of tenfold cross validation

Table 3
Summary of the four main neural network modelsAbbreviations: MAF Minor Allele Frequency, CNN Convolutional Neural Network, KFold sklearn.model_selection.KFold, CV Cross Validation, MSSS python package, iterstrat.ml_stratifiers.MultilabelStratifiedShuffleSplit, tpr true positive rate, tnr true negative rate, a Whether providing softwares that could be used

Table 4
The difference of metrics between the original models and the reimplementation versionThe column 'difference' is calculated as (value of reimplementation -value of original) / value of original.AUC stands for "Area under the ROC Curve"; MSE stands for Mean Squared Error